What is optimal price point per targeted group. How is this determined? To answer this question, a simluation will be created using distributions and parameters that are determined from a Kaggle dataset and findings from published journals.
Variables:
How is N number of surveys determined? (for future simulation) N is the sample size needed.
N# of Approved Conversions = (Approved Conversions / Clicks)_g x Clicks_i(Interest_Group_g x I# of Impressions)
Expected Cost for the client = N# of Approved Conversions + Cost of Ads + (Time x Incentive)
N# of Approved Conversions needed = N# of Approved Conversions obtained x Time x Incentive (Bayesian Inference) Rate of Approved Conversion = AP/Time x Incentive
Characteristics from this dataset will be used in the parameterizing the model.
Downloaded from https://www.kaggle.com/loveall/clicks-conversion-tracking/version/1
The data used in this project is from an anonymous organisationâs social media ad campaign. The data file can be downloaded from here. The file conversion_data.csv contains 1143 observations in 11 variables. Below are the descriptions of the variables.
1.) ad_id: an unique ID for each ad. 2.) xyz_campaign_id: an ID associated with each ad campaign of XYZ company. 3.) fb_campaign_id: an ID associated with how Facebook tracks each campaign. 4.) age: age of the person to whom the ad is shown. 5.) gender: gender of the person to whim the add is shown 6.) interest: a code specifying the category to which the personâs interest belongs (interests are as mentioned in the personâs Facebook public profile). 7.) Impressions: the number of times the ad was shown. 8.) Clicks: number of clicks on for that ad. 9.) Spent: Amount paid by company xyz to Facebook, to show that ad. 10.) Total conversion: Total number of people who enquired about the product after seeing the ad. 11.) Approved conversion: Total number of people who bought the product after seeing the ad.
ad_id is labeled numerically, but is nominal, meaning it’s just a name. After noticing major gaps between the numbers, it is apparent that much of the data are missing.
summary(KAG)
## ad_id xyz_campaign_id fb_campaign_id age gender
## 708746 : 1 1: 54 420 : 6 30-34:426 F:551
## 708749 : 1 2:464 433 : 6 35-39:248 M:592
## 708771 : 1 3:625 467 : 6 40-44:210
## 708815 : 1 477 : 6 45-49:259
## 708818 : 1 502 : 6
## 708820 : 1 538 : 6
## (Other):1137 (Other):1107
## interest Impressions Clicks Spent
## 5 :140 Min. : 1.0 Min. : 1.00 Min. : 1.0
## 3 : 85 1st Qu.: 274.5 1st Qu.: 2.00 1st Qu.: 52.0
## 17 : 77 Median : 559.0 Median : 9.00 Median :298.0
## 15 : 60 Mean : 560.7 Mean : 30.05 Mean :335.2
## 4 : 51 3rd Qu.: 844.5 3rd Qu.: 38.50 3rd Qu.:583.5
## 16 : 51 Max. :1130.0 Max. :183.00 Max. :869.0
## (Other):679
## Total_Conversion Approved_Conversion
## Min. : 1.000 Min. : 1.000
## 1st Qu.: 2.000 1st Qu.: 1.000
## Median : 2.000 Median : 2.000
## Mean : 3.802 Mean : 1.933
## 3rd Qu.: 4.000 3rd Qu.: 2.000
## Max. :32.000 Max. :16.000
##
head(KAG)
## # A tibble: 6 x 11
## ad_id xyz_campaign_id fb_campaign_id age gender interest Impressions
## <fct> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 7087~ 1 1 30-34 M 4 289
## 2 7087~ 1 2 30-34 M 5 420
## 3 7087~ 1 3 30-34 M 8 50
## 4 7088~ 1 4 30-34 M 16 213
## 5 7088~ 1 4 30-34 M 16 212
## 6 7088~ 1 5 30-34 M 17 130
## # ... with 4 more variables: Clicks <dbl>, Spent <dbl>,
## # Total_Conversion <dbl>, Approved_Conversion <dbl>
ecdf(as.numeric(as.character(KAG$ad_id))) %>% plot()
max(as.numeric(as.character(KAG$ad_id))) - min(as.numeric(as.character(KAG$ad_id)))
## [1] 605669
The graph shows that massive amounts of data are missing. The data suggest that only 1,143 observations are given out of a range of 605,669. The data only show observations with at least 1 Click, 1 Total Conversion, and 1 Approved Conversion, so we have no ability to predict how often our predictor variables did not get a click, and considering the range vs what we have, we can only guess that this happened, roughly, 600 times more often.
This is not an advertising campaign, but the ad cost model will be on Cost per 1000 Impressions. This data appears to be based on some combination or cost per click AND cost per impression, being that some Spent variables are 1 when Click is one, while Impressions can be any number. I believe the data are summarized, so this makes building a regression model far more difficult.
Ads are ranked based on who they’re targeting, the higher the bid does not mean the bid will win. FB adjusts bids based on our budget and time left. FB determines which Ad wins the auction by how relevant it is to the person who will see it. This works the same way on Instagram, Messenger, and Audience Network. Ads are ranked by Total Value, a numeric value calculated based on multiple factors, including the bid, predictions on how their audience will react, and how relevant they find the ad. Account relevance and Ad quality make the ad competitive:
[Advertiser Bid] x [Estimated Action Rates] + [User Value] = Total Value
[Advertiser Bid] = bid strategy, bid cap or cost target
[Estimated Action Rates] = optimization event. Based on past historial performance. advertiser values converted to effective CPM Cost per 1,000 Impressions, total amount spent divided by Impressions * 1,000.
opt for impressions: eCPM = (Advertiser bid per impressions) opt for clicks: eCPM = (Advertiser bid per click) x (estimated click-through rate) x 1,000 opt for actions: eCPM = (Advertiser bid per action) x (estimated click-through rate) x (estimated conversion rate) 1,000
| Ad type | Bid | Outcome | eCTR | eCVR | per 1,000 impressions served | equals | eCPM (dollars) |
|---|---|---|---|---|---|---|---|
| Ad 1 | $10 | per 1,000 impressions | 1 | equals | 10 | ||
| Ad 2 | $1 | per One (1) click | 1% | 1,000 | equals | 10 | |
| Ad 3 | $100 | per One (1) conversion | 1% | 1% | 1,000 | equals | 10 |
eCTR = Clicks/Impressions eCVR = Approved_Conversions/Clicks * 100
Impressions and clicks are what will be used to predict spending, so these variables need to be randomized in our simulation.
rvariables <- KAG %>%
group_by(age, gender, interest) %>%
summarize(mImp = mean(Impressions),
vImp = sqrt(var(Impressions)),
mClicks = mean(Clicks),
vClicks = sqrt(var(Clicks)),
CPM = round(mean(Spent/Impressions)*10,3))
ggplot(rvariables, aes(x=mImp)) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="#FF6666")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(rvariables, aes(x=mClicks)) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="#FF6666")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(rvariables, aes(x=CPM)) +
geom_histogram(colour="black", fill="white")+
geom_density(alpha=.2, fill="#FF6666")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Impressions appear to be somewhat normally distributed. Clicks appear poisson or negative binomially distributed. CPM looks like it could be normally distributed but skewed to the left. NOTE Remember that this data only contained ads that resulted in at least 1 click and 1 approved conversion, so it will heavily favor ads that we not seen often, since they also resulted in a conversion.
The number of Impressions is entirely randomly distributed. Impressions is based on Facebook’s relevance score; Facebook will only show the ad based on how well previous ads aligned with the viewers. The relevance score will determine how many Impressions Facebook will allow the ad to have. This means that using a random number of impressions on different groups doesn’t work, since each group will have a different relevance score.
Consider the following data from one specific group and observe how the Spent variable fluctuates. We can see a large variability in Impressions and the cost to reach this group.
KAG %>%
select(age, gender, interest, Impressions, Clicks, Spent) %>%
mutate(Cost_Click = Spent/Clicks,
CTR = (Clicks/Impressions)*1000,
bid = Spent/CTR,
CPM = round((Spent/Impressions)*10,3)) %>%
filter(age == "30-34",
gender == "M",
interest == "4")
## # A tibble: 15 x 10
## age gender interest Impressions Clicks Spent Cost_Click CTR bid
## <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 30-34 M 4 289 2 47 23.5 6.92 6.79
## 2 30-34 M 4 405 4 176 44 9.88 17.8
## 3 30-34 M 4 392 4 140 35 10.2 13.7
## 4 30-34 M 4 154 1 1 1 6.49 0.154
## 5 30-34 M 4 150 1 1 1 6.67 0.15
## 6 30-34 M 4 281 1 1 1 3.56 0.281
## 7 30-34 M 4 508 7 275 39.3 13.8 20.0
## 8 30-34 M 4 435 3 122 40.7 6.90 17.7
## 9 30-34 M 4 232 1 1 1 4.31 0.232
## 10 30-34 M 4 116 1 1 1 8.62 0.116
## 11 30-34 M 4 496 2 75 37.5 4.03 18.6
## 12 30-34 M 4 1117 110 805 7.32 98.5 8.17
## 13 30-34 M 4 1130 179 869 4.85 158. 5.49
## 14 30-34 M 4 495 2 64 32 4.04 15.8
## 15 30-34 M 4 938 31 558 18 33.0 16.9
## # ... with 1 more variable: CPM <dbl>
A rating from 1 to 10 that estimates how well your target audience is responding to your ad. This score is shown after your ad receives more than 500 impressions. The higher the score, the better. Relevance scores take about 3-7 days, depending on your budget. Usually the ad algorithm takes about 7 days to kick in and start its learning phase so the auction can work more effectively.
The Relevance Score measures quality and engagement level of the ads. It determines the Cost per Click and how frequently FB shows the ad. It is primarily based on expected positive and negative feedback. The factors are:
Negative feedback has about 10x the impact as positive feedback. Facebook will get rid of relevance scores on April 30th. It will be using:
These metrics will depend heavily on the quality of the ad visuals and message.
Facebook calculates a relevance score after 500 impresions, so we will begin with 500 random observations.
theory <- tibble(relevance = round(runif(500, min = 0, max = 10), 1),
cost = (0))
theory <- theory %>%
mutate(cost = ifelse(relevance >=8, abs(rnorm(length(relevance), 11-relevance, 2)),
ifelse(relevance < 8 & relevance >4, abs(rnorm(length(relevance), 11-relevance, 2.5))+1, abs(rnorm(length(relevance), 11-relevance, 2)+2))),
quality = ifelse(relevance >=8, abs(rnorm(length(relevance), relevance, 1.8)),
ifelse(relevance < 8 & relevance >4, abs(rnorm(length(relevance), relevance, 1.5)), abs(rnorm(length(relevance), relevance, 1.2)))),
group = ifelse(relevance >= 8, "1",
ifelse(relevance < 8 & relevance >4, "2", "3")),
quality = ifelse(quality >10, 10, quality))
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(theory, x = ~relevance, y = ~quality, z = ~cost, color = ~group) %>%
add_markers()
The model that I have in mind looks like this:
xexp <- log(10, base = exp(1.3))
customGreen0 = "#DeF7E9"
customGreen = "#71CA97"
customRed = "#ff7f7f"
Simulation <- function(days = 30,
daily_impressions=2000,
mean_relevance = 7.2,
sd_relevance = .8,
relevance_filter = 7,
probability_sd = 0.024,
cost_sd_rh = 1.75,
cost_sd_rm = 2,
cost_sd_rl = 2,
dollar_incentive_min = 2,
dollar_incentive_max = 20,
quality_reduction_slope = -1/25,
quality_reduction_intercept = 1.4
){
Simulations <- list()
for(i in 1:days){
t_sim <- tibble(relevance = round(rnorm(daily_impressions, mean = mean_relevance, sd = sd_relevance), 1),
cost = ifelse(relevance >=8, abs(rnorm(length(relevance), 11-relevance, cost_sd_rh)),
ifelse(relevance < 8 & relevance >4, abs(rnorm(length(relevance), 11-relevance, cost_sd_rm))+1, abs(rnorm(length(relevance), 11-relevance, cost_sd_rl)+2))),
quality = ifelse(relevance >=8, abs(rnorm(length(relevance), relevance, 1.8)),
ifelse(relevance < 8 & relevance >4, abs(rnorm(length(relevance), relevance, 1.5)), abs(rnorm(length(relevance), relevance, 1.2)))),
group = ifelse(relevance >= 8, "1",
ifelse(relevance < 8 & relevance >4, "2", "3"))) %>%
filter(relevance >= relevance_filter)
Simulations[[i]] <- t_sim %>%
mutate(base_prob = (relevance - relevance_filter)/100,
prob_variation = abs(round(rnorm(nrow(t_sim), base_prob, probability_sd),3)),
prob_variation = ifelse(prob_variation <= 0, round(runif(nrow(t_sim), .001, .009),3),prob_variation),
incentive = round(runif(nrow(t_sim), dollar_incentive_min,dollar_incentive_max)),
inc_prob = prob_variation * log(incentive, base = exp(xexp)),
survey = rbinom(nrow(t_sim), 1, inc_prob),
qualdecr = quality_reduction_slope*incentive+quality_reduction_intercept,
quality_var = quality *qualdecr) %>%
select(relevance, cost, quality_var, incentive, inc_prob, survey) %>%
mutate(Ad_Cost = round(cost, 3),
Response_Quality = round(quality_var, 3),
Survey_Probability = round(inc_prob*100, 3),
Dollar_Incentive = incentive) %>%
rename(FB_Relevance_Score = relevance, Survey_Completed = survey) %>%
select(FB_Relevance_Score, Ad_Cost, Dollar_Incentive, Response_Quality, Survey_Probability, Survey_Completed) %>%
arrange(desc(FB_Relevance_Score))
}
return(do.call(bind_rows, list(.id = "Day", Simulations)))
}
Simulation_Summary <- function(sim, by = "all"){
byday <- sim %>%
mutate(Day = as.numeric(Day)) %>%
group_by(Day) %>%
summarize(Total_Impressions = n(),
Surveys_Completed = sum(Survey_Completed),
Mean_Probability = round(mean(Survey_Probability),2),
Mean_Incentive = round(mean(ifelse(Survey_Completed==1, Dollar_Incentive, NA), na.rm = TRUE),2),
Total_Incentive_Cost = sum(ifelse(Survey_Completed==1, Dollar_Incentive, 0)),
Mean_Relevance = round(mean(FB_Relevance_Score),2),
Mean_Response_Quality = round(mean(Response_Quality,2))) %>%
arrange(Day) %>%
mutate(Day = as.character(Day))
total <- sim %>%
summarize(Day = paste0(nrow(byday), " Days"),
Total_Impressions = n(),
Surveys_Completed = sum(Survey_Completed),
Mean_Probability = round(mean(Survey_Probability),2),
Mean_Incentive = round(mean(ifelse(Survey_Completed==1, Dollar_Incentive, NA), na.rm = TRUE),2),
Total_Incentive_Cost = sum(ifelse(Survey_Completed==1, Dollar_Incentive, 0)),
Mean_Relevance = round(mean(FB_Relevance_Score),2),
Mean_Response_Quality = round(mean(Response_Quality,2)))
if(by == "all"){
bind_rows(total, byday)
} else if(by == "total"){
total
} else if(by == "day"){
byday
} else if(by == "csurvey"){
sim %>%
filter(Survey_Completed==1)
} else {
sim
} %>%
formattable(align = c("c", "c", "c", "c","c","c","c"),
list('FB_Relevance_Score' = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
'Ad_Cost' = color_tile(customGreen0, customGreen),
'Dollar_Incentive' = color_tile(customGreen0, customGreen),
'Response_Quality' = color_tile(customGreen0, customGreen),
'Survey_Probability' = color_tile(customGreen0, customGreen)))
}
The actual simulation that we’ll be using acounts for a much greater variability, as well as interactions between different variables.
theory <- Simulation(days = 1,
daily_impressions=500,
mean_relevance = 5,
sd_relevance = 2,
relevance_filter = 0,
probability_sd = 0.024,
cost_sd_rh = 1.75,
cost_sd_rm = 2,
cost_sd_rl = 2,
dollar_incentive_min = 2,
dollar_incentive_max = 20,
quality_reduction_slope = -1/25,
quality_reduction_intercept = 1.4
)
#Output will be FB_Relevance_Score, Ad_Cost, Dollar_Incentive, Response_Quality, Survey_Probability, Survey_Completed.
library(plotly)
plot_ly(theory, x = ~FB_Relevance_Score, y = ~Response_Quality, z = ~Ad_Cost, color = ~Survey_Probability, type = 'scatter3d', mode = 'markers', marker = list(size = 3))
The 1st theoretical model is far more simplistic, but we expect behavior to resemble the latter model. The dots are colored by the probability that the respondent will complete the survey.
Facebook will be unlikely to show the ad if the relevance score is low, so we expect only ads with a relevance score greater 6 to be shown.
To get a better understanding of the relationship between quality and relevance score, we look at the following plot.
ggplot(theory, aes(x = FB_Relevance_Score, y = Response_Quality, color = Survey_Probability)) + geom_point()
We see that there are fewer low quality respondents when relevance score is high. We expect low relevance to yield low quality responses.
KAG3.1 <- KAG %>%
filter(xyz_campaign_id == 3) %>%
mutate(CPM =(Clicks/Impressions)*1000,
SPM =(Spent/Impressions)*1000,
cpc = Spent/Clicks,
apc = Approved_Conversion/Clicks) %>%
mutate(relevance_approx = ifelse(CPM < 39, "r7",
ifelse(CPM > 39 & CPM < 83, "r8", "r9up")))
histo <- KAG3.1 %>%
ggplot(aes(x=CPM, color = relevance_approx)) +
geom_histogram()
ggplotly(histo)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We need to determine the impact that Relevance has on Click Through Rate and Impressions, so we look at the previous histogram.
In order to determine how relevance score is associated with the likelihood of completing a survey, I used the previously mentioned data from Kaggle. Relevance score was not included so I used Clicks per 1000 Impressions as a measure of interest on the ad. The histogram shows how CPM will be used to approximate relevance. The higher the CPM, the higher likelihood that the user will take the survey.
I wanted to use Cost per 1000 Impressions but this data is too heavily skewed to the left. Because the data only shows conversions, Cost per 1000 Impressions will be much lower for observations with low impressions.
library(formattable)
## Warning: package 'formattable' was built under R version 3.5.3
##
## Attaching package: 'formattable'
## The following object is masked from 'package:plotly':
##
## style
KAG3.1 %>%
select(relevance_approx, cpc, apc, SPM) %>%
group_by(relevance_approx) %>%
summarize(Mean_APC = mean(apc),
SD_APC = sd(apc),
Mean_CPM = mean(SPM),
total_observations = n()) %>%
formattable()
| relevance_approx | Mean_APC | SD_APC | Mean_CPM | total_observations |
|---|---|---|---|---|
| r7 | 0.22330038 | 0.28503202 | 506.4588 | 314 |
| r8 | 0.04472960 | 0.02482618 | 711.9548 | 159 |
| r9up | 0.03184836 | 0.02429892 | 757.2053 | 152 |
Mean Actions Per Click show that the higher the CPM, the lower the mean APC. This is a fault of the data, since only ads with clicks were shown. This means that ads with low impressions will be favored. We notice that between R8 and R9 that the mean is about .01 different and the standard deviations are very similar, despite having many observations. I believe that the mean APC will be .03 for R9, .2 for R8, and .1 for R7, with a standard deviation of .024.
We also see that these ad campaigns costed on between 5.06 to 7.50 dollars.
Based on a past study, I read that a $10 incentive increased the likelihood of a person taking the survey. I also read that incentives may have a ceiling. This means that as the incentive increases, the probability of a person taking the survey also increases but at a diminishing rate. This appears to follow a logarithmic function.
log(10, base = exp(1.3))
## [1] 1.771219
This logarithm uses a base ^{1.3} rather than the standard base e to meet this assumption.
xexp <- log(10, base = exp(1.3))
log(10, base = exp(xexp))
## [1] 1.3
incentives <- seq(1,20,1)
incentbydollar <- tibble(incentive = log(incentives, base=exp(xexp)),
dollars = incentives,
qualdecr = (-1/25)*incentives+1.4)
ggplot(incentbydollar, aes(x=dollars, y = incentive)) +
geom_point() +
geom_abline(intercept = (1-1/xexp), slope = 1/(exp(xexp)*xexp)) +
geom_hline(yintercept = 1, linetype = "dashed", color = "blue", size = .5) +
geom_vline(xintercept = exp(xexp), linetype = "dashed", color = "blue", size = .5) +
geom_point(aes(x=dollars, y = qualdecr, color = "red"))
This model says that from 0 - 6 dollars, the probability of taking a survey is reduced, but the probability increases at a fast rate for every dollar given. At 6 dollars, the probability of taking the survey is multiplied by 1. Every dollar after this increases the probability by more, but at a diminishing rate. At 10 dollars, the probability increases by 30% as the previous study concluded. The tangent line shows the rate of increased probability when the incentive multiplier equals 1.
I also read that as the incentive increases, the quality of the survey responses decrease. Greater incentive encourages respondents who are less motivated by the survey and more by the money, also, it creates a feeling of “being bribed”. This means that greater incentive should negatively impact quality. The red dots demonstrate this and will be used as a multiplier against quality.
The scalers are applied and the probability of completing the survey is randomly generated through the binomial distribution (probability and standard deviation based on the the Kaggle data).
relv <- theory %>% filter(FB_Relevance_Score >= 7)
plot_ly(relv, x = ~FB_Relevance_Score, y = ~Response_Quality, z = ~Dollar_Incentive, color = ~Survey_Probability, symbol = ~Survey_Completed, symbols = c("o", "circle"), type = 'scatter3d', mode = 'markers', marker = list(size = 5))
The plot shows which of the respondents completed a survey with a filled dot. The are plotted with their relevance scores, the expected quality of their responses, and the amount of their incentive. The color shows the probabilty that they would, the lighter the color, the higher the probability.
relv %>% Simulation_Summary(by="csurvey")
## # A tibble: 12 x 7
## Day FB_Relevance_Sc~ Ad_Cost Dollar_Incentive Response_Quality
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 9.1 1.46 17 8.43
## 2 1 8.6 3.23 17 6.25
## 3 1 8.5 1.85 17 6.92
## 4 1 8 2.98 2 8.76
## 5 1 7.9 2.19 8 6.08
## 6 1 7.9 6.39 14 7.57
## 7 1 7.4 1.45 8 8.25
## 8 1 7.3 6.02 15 6.96
## 9 1 7.2 6.50 15 6.52
## 10 1 7.2 9.05 7 8.10
## 11 1 7.2 1.45 3 7.16
## 12 1 7.1 2.54 14 6.42
## # ... with 2 more variables: Survey_Probability <dbl>,
## # Survey_Completed <int>
This is the summary of this sample. The results are random at every iteration, but we should expect to see that Ad costs should be lower, Dollar Incentive should be somewhere above 10, response quality varied, and survey probabilities higher with high relevance scores.
Now that we and Facebook know how our ads work with certain groups, we begin to target those groups better. This simulation will be better placed, with 2000 Impressions per day, and a normally distributed relevance score, with mean 7.2 and sd = 0.8.
Day_sim <- Simulation(days = 1,
daily_impressions=2000,
mean_relevance = 7.2,
sd_relevance = .8,
relevance_filter = 6,
probability_sd = 0.024,
cost_sd_rh = 1.75,
cost_sd_rm = 2,
cost_sd_rl = 2,
dollar_incentive_min = 2,
dollar_incentive_max = 20,
quality_reduction_slope = -1/25,
quality_reduction_intercept = 1.4
)
plot_ly(Day_sim, x = ~FB_Relevance_Score, y = ~Response_Quality, z = ~Dollar_Incentive, color = ~Survey_Probability, symbol = ~Survey_Completed, symbols = c("o", "circle"), type = 'scatter3d', mode = 'markers', marker = list(size = 3))
Day_summary = Day_sim %>%
filter(Survey_Completed==1) %>%
summarize(FB_Relevance_Score=mean(FB_Relevance_Score),
Ad_Cost=mean(Ad_Cost),
Dollar_Incentive=mean(Dollar_Incentive),
Response_Quality=mean(Response_Quality),
Survey_Probability=mean(Survey_Probability),
Surveys_Completed = sum(Survey_Completed))
Day_summary %>%
formattable(align = c("c", "c", "c","c","c"),
list('FB_Relevance_Score' = formatter("span", style = ~ style(color = "grey", font.weight = "bold")),
'Ad_Cost' = color_tile(customGreen0, customGreen),
'Dollar_Incentive' = color_tile(customGreen0, customGreen),
'Response_Quality' = color_tile(customGreen0, customGreen),
'Survey_Probability' = color_tile(customGreen0, customGreen)))
| FB_Relevance_Score | Ad_Cost | Dollar_Incentive | Response_Quality | Survey_Probability | Surveys_Completed |
|---|---|---|---|---|---|
| 7.298 | 4.82168 | 11.42 | 6.8581 | 4.6529 | 50 |
Simulation_Summary(Day_sim, by = "csurvey")
## # A tibble: 50 x 7
## Day FB_Relevance_Sc~ Ad_Cost Dollar_Incentive Response_Quality
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 9.5 0.173 7 8.52
## 2 1 8.9 0.894 14 8.51
## 3 1 8.6 2.50 16 8.76
## 4 1 8.5 3.97 16 6.40
## 5 1 8.2 2.77 6 9.15
## 6 1 8 0.918 12 5.51
## 7 1 7.9 2.94 17 3.91
## 8 1 7.8 5.20 8 7.88
## 9 1 7.8 7.05 14 4.08
## 10 1 7.7 5.07 13 3.52
## # ... with 40 more rows, and 2 more variables: Survey_Probability <dbl>,
## # Survey_Completed <int>
Month_sim <- Simulation(days = 30,
daily_impressions=2000,
mean_relevance = 7.2,
sd_relevance = .8,
relevance_filter = 6,
probability_sd = 0.024,
cost_sd_rh = 1.75,
cost_sd_rm = 2,
cost_sd_rl = 2,
dollar_incentive_min = 2,
dollar_incentive_max = 20,
quality_reduction_slope = -1/25,
quality_reduction_intercept = 1.4
)
table(Month_sim$Survey_Completed)
##
## 0 1
## 54878 1588
Month_sim %>%
ggplot(aes(x=Dollar_Incentive)) +
geom_histogram(bins=19, color = "white") +
facet_grid(Survey_Completed~., scales = "free")
Month_sim %>%
ggplot(aes(x=Response_Quality)) +
geom_histogram(bins=19, color = "white") +
facet_grid(Survey_Completed~., scales = "free")
Month_sim %>%
ggplot(aes(x= FB_Relevance_Score)) +
geom_histogram(bins=10, color = "white") +
facet_grid(Survey_Completed~., scales = "free")
Month_sim %>%
ggplot(aes(x= Ad_Cost)) +
geom_histogram(bins=20, color = "white") +
facet_grid(Survey_Completed~., scales = "free")
Month_sim %>%
select(-Day, -Survey_Completed) %>%
cor() %>%
formattable()
## FB_Relevance_Score Ad_Cost Dollar_Incentive
## FB_Relevance_Score 1 -0.4536 0.001087
## Ad_Cost -0.4536 1 -0.003953
## Dollar_Incentive 0.001087 -0.003953 1
## Response_Quality 0.302 -0.1361 -0.6709
## Survey_Probability 0.1594 -0.07859 0.3188
## Response_Quality Survey_Probability
## FB_Relevance_Score 0.302 0.1594
## Ad_Cost -0.1361 -0.07859
## Dollar_Incentive -0.6709 0.3188
## Response_Quality 1 -0.1679
## Survey_Probability -0.1679 1
We see that mount of surveys completed 54878, 1588 and several histograms showing the differences in the respondents who completed the survey against those who didn’t. The first is the most notable, showing that as incentive increased, as did the number of respondents. The height of the histograms is not scaled, so the note the counts on the left side. The correlation matrix shows that none of the variables are too related, save for the negative association between Incentive and Quality.
library(ggridges)
## Warning: package 'ggridges' was built under R version 3.5.3
##
## Attaching package: 'ggridges'
## The following object is masked from 'package:ggplot2':
##
## scale_discrete_manual
Month_sim <- Simulation()
Month_sim %>%
filter(Survey_Completed==1) %>%
select(Dollar_Incentive, Response_Quality) %>%
arrange(Dollar_Incentive, Response_Quality) %>%
mutate(Dollar_Incentive = as.factor(Dollar_Incentive)) %>%
ggplot(aes(x = Response_Quality, y = Dollar_Incentive, height = ..density..)) +
geom_density_ridges_gradient(aes(fill = ..x..)) +
scale_fill_gradientn(
colours = c("#0D0887FF", "#CC4678FF", "#F0F921FF"),
name = "Quality"
)+
labs(title = 'Response Quality vs Incentive')
## Picking joint bandwidth of 0.639
Here, we get a better understanding of the relationship between Incentive and Quality. The simulation has these two with a negative linear correlation.
summary_Month_sim <- Simulation_Summary(Month_sim)
summary_Month_sim
## # A tibble: 31 x 8
## Day Total_Impressio~ Surveys_Complet~ Mean_Probability Mean_Incentive
## <chr> <int> <int> <dbl> <dbl>
## 1 30 D~ 37375 979 2.58 12.7
## 2 1 1205 37 2.62 11.9
## 3 2 1213 29 2.55 13.9
## 4 3 1217 32 2.56 12.6
## 5 4 1222 31 2.66 14.0
## 6 5 1245 33 2.57 12.5
## 7 6 1244 33 2.6 12.8
## 8 7 1261 30 2.63 12.3
## 9 8 1284 32 2.7 12.7
## 10 9 1261 27 2.53 11.5
## # ... with 21 more rows, and 3 more variables: Total_Incentive_Cost <dbl>,
## # Mean_Relevance <dbl>, Mean_Response_Quality <dbl>
Now we see that we have completed 979 at the cost of 1.242610^{4}. We must fund the optimal incentive cost with regard to the highest probability of completing a survey.
Month_sim %>%
group_by(Dollar_Incentive) %>%
summarize(n_impressions = n(),
n_surveys = sum(Survey_Completed),
survey_cost = sum(ifelse(Survey_Completed==1,Dollar_Incentive,0)),
ad_cost = sum(Ad_Cost/100),
mean_prob = mean(Survey_Probability),
sd_prob = sd(Survey_Probability)) %>%
arrange(desc(Dollar_Incentive))
## # A tibble: 19 x 7
## Dollar_Incentive n_impressions n_surveys survey_cost ad_cost mean_prob
## <dbl> <int> <int> <dbl> <dbl> <dbl>
## 1 20 1075 39 780 44.5 3.41
## 2 19 2104 75 1425 86.3 3.42
## 3 18 2020 77 1386 83.8 3.34
## 4 17 2118 80 1360 86.8 3.16
## 5 16 2040 60 960 83.3 3.18
## 6 15 2156 79 1185 88.6 2.98
## 7 14 2061 72 1008 84.6 3.04
## 8 13 2128 62 806 88.2 2.92
## 9 12 2046 63 756 85.0 2.86
## 10 11 2060 40 440 83.7 2.71
## 11 10 2099 52 520 87.3 2.66
## 12 9 2055 49 441 83.5 2.56
## 13 8 2110 55 440 85.5 2.40
## 14 7 1988 36 252 83.3 2.24
## 15 6 2118 48 288 86.6 2.10
## 16 5 2072 36 180 84.8 1.85
## 17 4 2094 36 144 87.4 1.61
## 18 3 2034 15 45 84.4 1.25
## 19 2 997 5 10 40.4 0.772
## # ... with 1 more variable: sd_prob <dbl>
This gives us a better look at how expensive these incentives are, and how much they are associated with Ad Cost and the rate at which the surves were completed.
Month_sim %>%
select(Dollar_Incentive, Survey_Probability) %>%
arrange(Dollar_Incentive, Survey_Probability) %>%
mutate(Dollar_Incentive = as.factor(Dollar_Incentive)) %>%
ggplot(aes(x = Survey_Probability, y = Dollar_Incentive, height = ..density..)) +
geom_density_ridges_gradient(aes(fill = ..x..)) +
scale_fill_gradientn(
colours = c("#0D0887FF", "#CC4678FF", "#F0F921FF"),
name = "Probability"
)+
labs(title = 'Survey Probability vs Incentive')
## Picking joint bandwidth of 0.378
Here, we see that as the Incentive increases, so does the probability that a respondent will complete a survey.
NOTE Dollar Incentive can be relative to a respondent group’s median hourly income, and the scale could be a z-score from this mean. This would mean that the simulation can be used to represent all groups, assuming that each group behaves similarly.
The optimal incentive minimizes the incentive point, while maximizing the probability of taking the survey.
This first opmtimization uses the parameters without the randomness included.
incentive = seq(2,20,1)
inc_func_theoretical <- function(incentive) {
mean_prob = seq(.001,.03, .001)
inc_by_prob = (mean_prob * log(incentive, base = exp(xexp)))
return(max(inc_by_prob))
}
plot(optim(incentive, inc_func_theoretical)$par)
This suggests that at 16 is the lowest incentive with the highest rate of survey completion.
mean_probs_emp <- Month_sim %>%
group_by(Dollar_Incentive) %>%
summarize(mean_prob = mean(Survey_Probability)) %>%
arrange((Dollar_Incentive)) %>%
select(mean_prob)
inc_func_emp <- function(incentive) {
inc_by_prob = (mean_probs_emp * log(incentive, base = exp(xexp)))
return(max(inc_by_prob))
}
inc_prob <- sapply(incentive, inc_func_emp)
plot(incentive, optim(incentive, inc_func_emp)$par)
However, the emprical distribution suggests 15 is the optimal price point.
incentive = seq(2,20,1)
incentive_b_dollar <- tibble(Probability_of_Survey =inc_prob,
Incentive = incentive)
optimal_price <- (which.max(optim(incentive, inc_func_emp)$par)+1)
ggplot(incentive_b_dollar, aes(x=Incentive, y = Probability_of_Survey)) +
geom_point() +
geom_hline(yintercept = incentive_b_dollar$Probability_of_Survey[optimal_price-1], linetype = "dashed", color = "red", size = .5) +
geom_vline(xintercept = (optimal_price), linetype = "dashed", color = "red", size = .5)+
geom_hline(yintercept = incentive_b_dollar$Probability_of_Survey[which(incentive_b_dollar[,"Incentive"]==10)], linetype = "dashed", color = "blue", size = .5) +
geom_vline(xintercept = 10, linetype = "dashed", color = "blue", size = .5)
The red crosshairs represent the minimum incentive point in which the probability of taking a survey is maximized. The blue crosshairs mark where we believed the optimal point would be.
OP_Sim <- Simulation(dollar_incentive_min = optimal_price,
dollar_incentive_max = optimal_price)
Simulation_Summary(OP_Sim, by = "total")
## # A tibble: 1 x 8
## Day Total_Impressio~ Surveys_Complet~ Mean_Probability Mean_Incentive
## <chr> <int> <int> <dbl> <dbl>
## 1 30 D~ 37651 1187 3.13 15
## # ... with 3 more variables: Total_Incentive_Cost <dbl>,
## # Mean_Relevance <dbl>, Mean_Response_Quality <dbl>
# SIMULATION PARAMETERS
# Simulation(days = 30,
# daily_impressions=2000,
# mean_relevance = 7.2,
# sd_relevance = .8,
# relevance_filter = 7,
# probability_sd = 0.024,
# cost_sd_rh = 1.75,
# cost_sd_rm = 2,
# cost_sd_rl = 2,
# dollar_incentive_min = 2,
# dollar_incentive_max = 20,
# quality_reduction_slope = -1/25,
# quality_reduction_intercept = 1.4
# )
month <- tibble()
Year <- list()
number_simulations <- 100
Optimal_Incentives <- as.numeric(vector(length = number_simulations))
for(i in 1:number_simulations){
month <- Simulation()
mean_probs_emp <- month %>%
group_by(Dollar_Incentive) %>%
summarize(mean_prob = mean(Survey_Probability)) %>%
arrange((Dollar_Incentive)) %>%
select(mean_prob)
Year[[i]] <- month %>% Simulation_Summary("total")
inc_func_emp <- function(incentive) {
inc_by_prob = (mean_probs_emp * log(incentive, base = exp(xexp)))
return(max(inc_by_prob))
}
Optimal_Incentives[i] <- which.max(optim(incentive, inc_func_emp)$par)+1
}
OPI <- as.data.frame(table(Optimal_Incentives))
opt <- OPI[which.max(OPI$Freq),"Optimal_Incentives"]
opt <- as.numeric(as.character(opt))
ggplot(as.data.frame(Optimal_Incentives), aes(x=Optimal_Incentives)) +
geom_histogram(aes(y=..density..), colour="black", fill="white", bins = nrow(OPI))+
geom_density(alpha=.2, fill="#FF6666") +
geom_vline(xintercept = opt, color = "blue")
The maximum liklihood optimal point is 13.